Analyzing CMPD Traffic Stops
In this post, we’ll examine a data set of stops by the Charlotte-Mecklenburg Police Department (CMPD).
Our focus will be to understand what factors are related to whether someone is searched or not for a traffic stop.
# Loading the libraries
library(dplyr)
library(tidyverse)
library(scales)
library(ggspatial)
library(ggplot2)
library(stringr)
library(lubridate)
library(sf)
library(viridis)
library(ggridges)
library(gapminder)
library(forcats)
library(plotly)
df <- read_csv("Officer_Traffic_Stops.csv")
Demographics of drivers
First, look at the data using the glimpse() function from dplyr
glimpse(df) # Taking a glimpse at data
## Observations: 68,488
## Variables: 17
## $ Month_of_Stop <chr> "2017/06", "2017/06", "2017/06", "2017/…
## $ Reason_for_Stop <chr> "Vehicle Regulatory", "Vehicle Regulato…
## $ Officer_Race <chr> "White", "White", "White", "White", "Wh…
## $ Officer_Gender <chr> "Female", "Male", "Male", "Male", "Male…
## $ Officer_Years_of_Service <dbl> 3, 2, 5, 10, 2, 3, 2, 5, 25, 10, 25, 12…
## $ Driver_Race <chr> "Black", "Black", "Black", "White", "Bl…
## $ Driver_Ethnicity <chr> "Non-Hispanic", "Non-Hispanic", "Non-Hi…
## $ Driver_Gender <chr> "Male", "Male", "Male", "Male", "Female…
## $ Driver_Age <dbl> 23, 36, 45, 30, 37, 50, 50, 19, 41, 22,…
## $ Was_a_Search_Conducted <chr> "No", "No", "No", "No", "No", "No", "No…
## $ Result_of_Stop <chr> "Verbal Warning", "Verbal Warning", "Ve…
## $ CMPD_Division <chr> "North Tryon Division", "Central Divisi…
## $ ObjectID <dbl> 3001, 3002, 3003, 3004, 3005, 3006, 300…
## $ CreationDate <dttm> 2019-03-05 10:01:44, 2019-03-05 10:01:…
## $ Creator <chr> "CharlotteNC", "CharlotteNC", "Charlott…
## $ EditDate <dttm> 2019-03-05 10:01:44, 2019-03-05 10:01:…
## $ Editor <chr> "CharlotteNC", "CharlotteNC", "Charlott…
Notice the different variable types: character (chr), num (numeric), and datetime (POSIXct).
Let’s consider our target variable: Was_a_Search_Conducted.
How well balanced is the data set by this field?
A bar chart that counts the number of records by Was_a_Search_Conducted.
plot.count <- ggplot(data=df, aes(x=Was_a_Search_Conducted)) + # X and Y axeses
geom_bar(fill = "lightblue") + # Fill Color
geom_text(aes(label=..count..),stat = "count", # Label and it's position
position=position_stack(0.5))
plot.count

The data is heavily imbalanced. We have 65440 cases where search was not conducted where as there are only 3048 cases in which a search was conducted.
Next, let’s consider the age range of the driver.
Plotting a histogram of Driver_Age and identifying the appropriate bin size
plot.hist <- ggplot(data=df, aes(x=Driver_Age)) + # X and Y axeses
geom_histogram(bins = 40) # Bin size
plot.hist

plot.hist <- ggplot(data=df, aes(x=Driver_Age)) + # X and Y axeses
geom_histogram(bins = 50) # Bin size
plot.hist

plot.hist <- ggplot(data=df, aes(x=Driver_Age)) + # X and Y axeses
geom_histogram(bins = 60) # Bin size
plot.hist

Appropriate number of bins can be 50, this bin size gives a better understanding of the distribution of the driver’s age and number of stops.
The number of stops of drivers decreases as the age of the driver increases. There is a clear rise in count of stops at the the age of 21-30 years. This doesn’t necessarily means that drivers from 21 to 30 are more prone for stops based on the plot because the splike can even be because that is the age where more number of drivers are present.
A density plot of Driver_Age
plot.den <- ggplot(data=df, aes(x=Driver_Age)) + # X and Y axeses
geom_density(fill = "lightblue") + # Fill color
stat_density(adjust = 5, alpha = 0.5) # Adjust and transparency parameters
plot.den

A box plot with Was_a_Search_Conducted on the x-axis and Driver_Age on the y-axis.
plot.box <- ggplot(data=df, aes(x=Was_a_Search_Conducted, y=Driver_Age)) + geom_boxplot() # Box plot
plot.box

A violin plot with Was_a_Search_Conducted on the x-axis and Driver_Age on the y-axis.
plot.violin <- ggplot(data=df, aes(x=Was_a_Search_Conducted, y=Driver_Age)) + geom_violin() # Violin plot
plot.violin

From the above plots, it is very evident that the searches are conducted more for the drivers under the age of 30, mostly around 25.
Let’s plot the number of stops by time.
Recalling part one, the Month_of_Stop variable is a character, not a date variable. The datatime’s are simply when the data was collected; not when the stop occurred. Therefore, we’ll need to convert the Month_of_Stop variable from a character to a Date format.
Let’s first cleanup the date field using tidyverse packages like stringr and lubridate.
# Had to do a workround to make the code working for '-' by splitting the statement to mutiple steps
df <- df %>% mutate(Month_of_Stop = str_replace_all(Month_of_Stop,"/","-")) # replace "/" with "-"
df <- df %>% mutate(Month_of_Stop = paste0(df$Month_of_Stop,"-01")) # add in day
df <- df %>% mutate(Date = ymd(df$Month_of_Stop)) # created a date field
head(df)
## # A tibble: 6 x 18
## Month_of_Stop Reason_for_Stop Officer_Race Officer_Gender
## <chr> <chr> <chr> <chr>
## 1 2017-06-01 Vehicle Regula… White Female
## 2 2017-06-01 Vehicle Regula… White Male
## 3 2017-06-01 Stop Light/Sign White Male
## 4 2017-06-01 Vehicle Equipm… White Male
## 5 2017-06-01 Stop Light/Sign White Male
## 6 2017-06-01 Stop Light/Sign White Female
## # … with 14 more variables: Officer_Years_of_Service <dbl>,
## # Driver_Race <chr>, Driver_Ethnicity <chr>, Driver_Gender <chr>,
## # Driver_Age <dbl>, Was_a_Search_Conducted <chr>, Result_of_Stop <chr>,
## # CMPD_Division <chr>, ObjectID <dbl>, CreationDate <dttm>,
## # Creator <chr>, EditDate <dttm>, Editor <chr>, Date <date>
A line chart with the number of traffic stops for each month
plot.line <- df %>% count(Date) %>% ggplot(aes(x = Date, y = n)) + geom_line() + #line plot
geom_smooth(method = "auto") # trend line
plot.line
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

plot.line <- df %>% count(Date) %>% ggplot(aes(x = Date, y = n)) + geom_line() + #line plot
geom_smooth(method = "lm") # trend line
plot.line

Tried using different models to see whether there a defined pattern. Based on the trend line, there isn’t any stonng long term trend - number of traffic stops at charlotte is slightly going up. The data keeps fluctuating and gives more of a seasonality though it can’t be attributed to a specific time of the year.
Analyze data for different Reason_for_Stop
plot.facet <- df %>% count(Date, Reason_for_Stop) %>% ggplot(aes(x = Date, y = n)) + geom_line() + # line plot
geom_smooth(method = "lm") + # trend line
facet_wrap(~Reason_for_Stop) # faceting
plot.facet

?scale_x_date
Is there a problem with the plot?
The plot is very difficult to interpret because of the common scale used across all the faceted plots. The data in each of these plots are at different ranges which is making the interpretation using a common scale difficult.
To address this problem, we need to adjust the scale. Setting “free_y”/“free_x”/“free” this makes the y/x/both axis independent.
Plot the same plot but with a free y-axis scale.
plot.facet2 <- df %>% count(Date, Reason_for_Stop) %>% ggplot(aes(x = Date, y = n)) + geom_line() + # line plot
geom_smooth(method = "lm") + # trend line
facet_wrap(~Reason_for_Stop, scales = "free_y") # faceting and free y axis
plot.facet2

plot.facet2 <- df %>% count(Date, Reason_for_Stop) %>% ggplot(aes(x = Date, y = n)) + geom_line() + # line plot
geom_smooth(method = "lm") + # trend line
facet_wrap(~Reason_for_Stop, scales = "free_x") # faceting and free x axis
plot.facet2

plot.facet2 <- df %>% count(Date, Reason_for_Stop) %>% ggplot(aes(x = Date, y = n)) + geom_line() + # line plot
geom_smooth(method = "lm") + # trand line
facet_wrap(~Reason_for_Stop, scales = "free") # faceting and free x and y axeses
plot.facet2

Based on the plots, speeding seems to be the most volatile one. The number of stops keep swinging back and forth almost every month.
Since the y-axis scale is not uniform, comparisons between the different ‘Reason for Stop’ is not straightforward. Though some of the plots shows heavy upward on downward trend, the change rate can’t be really compared against another category because of the different scale.
Small multiples tends to be less effective when each of the variables are on different scales or magnitudes.
Let’s consider instead CMPD traffic stops but by CMPD division. These are more even spread by division than the type of stop.
Analyze stops by Date and Count
plot.div <- df %>% count(Date, CMPD_Division) %>% ggplot(aes(x = Date, y = n)) + # no: of stops by division
geom_line() + # line plot
geom_smooth(method = "auto") + # trend line
facet_wrap(~CMPD_Division) # faceting and free y axis
plot.div
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

plot.div <- df %>% count(Date, CMPD_Division) %>% ggplot(aes(x = Date, y = n)) + # no: of stops by division
geom_line() + # line plot
geom_smooth(method = "lm") + # trend line
facet_wrap(~CMPD_Division) # faceting and free y axis
plot.div

The number stops in North Tryon Division seems to have a very sudden spike in the beggining of 2017. North Division also shows the same behaviour. But overall trend in North Tryon Division is upward where as North Division is slightly trending low.
Number of stops in Metro Division and Freedom Division have been consistently low compared to other divisions
Number of stops in Western Division and South Division is having a higher rate of reduction when it comes to long term trends.
This doesn’t help tell us where these areas are. For that, let’s use a shape file to create a chloropleth of stops by division.
For this, we’ll create a cholorpleth for the number of police stops by police division. To do this, we need to use the sf package.
cmpd <- st_read("./CMPD_Police_Divisions/CMPD_Police_Divisions.shp") #load CMPD diisions
## Reading layer `CMPD_Police_Divisions' from data source `/Volumes/GoogleDrive/My Drive/Visual Analytics/ProbSet04/content/post/CMPD_Police_Divisions/CMPD_Police_Divisions.shp' using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -81.05795 ymin: 35.00218 xmax: -80.55006 ymax: 35.52533
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
Note that while we have five files, we only load in the shapefile (.shp) file. This is typical but know that to use this file you would need the other four files in the same folder as your shapefile.
Plot CMPD Divisions
cmpd <- cmpd %>%
mutate(Name = as.character(DNAME)) %>% # new variable Name
mutate(Name = str_replace_all(Name, " Division","")) # remove string 'Division' from Name
g <- ggplot(cmpd) +
geom_sf(aes(fill = DNAME)) + # fill with division name
theme_bw() + # dark-on-light ggplot2 theme
theme(legend.position = "none") + # no legends
geom_sf_text(aes(label = Name),size = 1.5) # adjust label size
g # displaying the plot
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may
## not give correct results for longitude/latitude data

Now let’s change the projection type
g <- g + coord_sf(crs = st_crs(102003)) # specifying coordinate ref system
g

coord_sf allows to override the default projection. The new map is generated with coordinate reference system ESRI:102003 which uses Albers Equal Area Conic projection. The difference is visible from the oreantation of grid lines / latitudes and longitudes.
Chloropleth
Now, let’s create a chloropleth.
cmpd_chloropleth <- cmpd %>%
mutate(CMPD_Division = as.character(DNAME)) %>%
inner_join(count(df, CMPD_Division, Date), by = "CMPD_Division") %>%
mutate(Year = lubridate::year(Date)) %>%
ggplot() +
geom_sf(aes(fill = n)) +
scale_fill_viridis("Traffic Stops", labels = scales::comma) +
labs(title = "CMPD Traffic stops by CMPD Division",
caption = "Source: CMPD") +
annotation_scale(location = "bl", width_hint = 0.2) +
facet_wrap(~Year) +
theme_bw() +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold", size = rel(1.5)),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank())
cmpd_chloropleth

ggsave(cmpd_chloropleth, filename = "cmpd_chloropleth.pdf",
width = 7, height = 5, units = "in")
ggsave(cmpd_chloropleth, filename = "cmpd_chloropleth.png",
width = 7, height = 5, units = "in")
Plot with ggridges
df.ridge <- count(df, Reason_for_Stop, Date) %>% # count based on Reason_for_Stop
mutate(Year = lubridate::year(Date))
ggplot(df.ridge, aes(Reason_for_Stop, Year, height = n, group = Year, fill = Reason_for_Stop)) + # ridge plot
geom_ridgeline_gradient() +
scale_fill_viridis(discrete = TRUE, direction = -1,alpha = 0.5) + # charateristics for scale parameters
theme(axis.text.x=element_blank())

df.ridge2 <- count(df, Reason_for_Stop, CMPD_Division) # count based on Reason_for_Stop, CMPD_Division
ggplot(df.ridge2, aes(x = n, y = CMPD_Division, fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_viridis(name = "Number of Stops", option='C') + # charateristics for scale parameters
labs(title = 'CMPD Stops')
## Picking joint bandwidth of 211

Intro to interactivity
An interactive version of one of the plots we discussed earlier
#
my_cool_plot <- cmpd %>%
mutate(CMPD_Division = as.character(DNAME)) %>%
inner_join(count(df, CMPD_Division, Date), by = "CMPD_Division") %>%
mutate(Year = lubridate::year(Date)) %>%
ggplot() +
geom_sf(aes(fill = n)) +
scale_fill_viridis("Traffic Stops", labels = scales::comma) +
labs(title = "CMPD Traffic stops by CMPD Division",
caption = "Source: CMPD") +
annotation_scale(location = "bl", width_hint = 0.2) +
facet_wrap(~Year) +
theme_bw() +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold", size = rel(1.5)),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank())
#
my_cool_plot

#
ggplotly(my_cool_plot)
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomScaleBar() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[2L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomScaleBar() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues